As a nerd, I’ve always been fascinated by dating statistics. I’m interested in exploring how the features of an area effect single ratios.
In the dating expert community, there is a widespread belief that the area you are in drastically effects dating. For example a common theory is that an uneven ratio of women to men leads to a higher single percentage of women. Also in areas with younger population, i.e. college towns there’s a much higher percentage of single people.
In this data exploration we’ll be exploring population data (provided by the kind people at towncharts.com) and how an areas features drive single percentage.
We’ll start by analyzing single variables in the data set to better understand population data.
First let’s see the structure and variables that are in our population data:
## 'data.frame': 45569 obs. of 60 variables:
## $ Row : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Area_Name : Factor w/ 35220 levels "Aaronsburg CDP-Centre County",..: 318 319 664 719 18119 3584 7813 8017 10047 12874 ...
## $ State : Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Family_Size : num 3.14 4.82 3.48 4.05 4.72 ...
## $ Ratio_Male : num 0.665 0.597 0.512 0.511 0.527 ...
## $ Ratio_Female : num 0.335 0.403 0.488 0.489 0.473 ...
## $ X.20 : num 0.143 0.229 0.277 0.368 0.39 ...
## $ X20s : num 0.16 0.16 0.181 0.159 0.163 ...
## $ X30s : num 0.144 0.16 0.142 0.14 0.121 ...
## $ X40s : num 0.234 0.169 0.128 0.1 0.111 ...
## $ X50s : num 0.194 0.162 0.138 0.131 0.111 ...
## $ X60s : num 0.1017 0.0962 0.0832 0.0632 0.0641 ...
## $ X70. : num 0.0239 0.0227 0.0513 0.0395 0.0408 ...
## $ Pop_White : num 0.215 0.316 0.649 0.129 0.119 ...
## $ Pop_Hispanic : num 0.1205 0.1096 0.0859 0.0165 0.0185 ...
## $ Pop_Black : num 0.09201 0.04733 0.05854 0.00264 0.00849 ...
## $ Pop_Asian : num 0.3054 0.3716 0.088 0.0277 0.0103 ...
## $ Pop_Other : num 0.387 0.265 0.204 0.841 0.862 ...
## $ Families : num 0.614 0.605 0.662 0.718 0.752 ...
## $ Non.Family : num 0.386 0.395 0.338 0.282 0.248 ...
## $ Family_Married_Couple: num 0.592 0.709 0.73 0.543 0.615 ...
## $ Family_Male_Head : num 0.1675 0.1395 0.0845 0.1646 0.155 ...
## $ Family_Female_Head : num 0.241 0.151 0.185 0.293 0.23 ...
## $ Living_Alone : num 0.744 0.747 0.76 0.76 0.654 ...
## $ Not.Living_Alone : num 0.744 0.747 0.76 0.76 0.654 ...
## $ Married_Total : num 0.538 0.44 0.493 0.405 0.419 ...
## $ Never.Married_Total : num 0.337 0.389 0.353 0.481 0.473 ...
## $ Divorced : num 0.093 0.1261 0.1185 0.0524 0.0664 ...
## $ Widowed : num 0.032 0.0447 0.0357 0.061 0.0415 ...
## $ Ratio_Single : num 0.462 0.56 0.507 0.595 0.581 ...
## $ Men.Never_Married : num 0.351 0.411 0.398 0.528 0.507 ...
## $ Men.Divorced : num 0.0789 0.1074 0.1014 0.0494 0.0688 ...
## $ Men.Widowed : num 0.0115 0.028 0.0174 0.0311 0.0237 ...
## $ Women.Never_Married : num 0.306 0.351 0.306 0.43 0.434 ...
## $ Women.Divorced : num 0.1224 0.1591 0.1364 0.0558 0.0637 ...
## $ Women.Widowed : num 0.0747 0.0741 0.0548 0.0936 0.0615 ...
## $ Men.Single_Base : int 826 1412 52698 272 2914 188 407 912 18589 419 ...
## $ Men.Single_18.24 : num 0.251 0.279 0.326 0.261 0.308 ...
## $ Men.Single_25.29 : num 0.149 0.127 0.18 0.221 0.158 ...
## $ Men.Single_30.34 : num 0.1138 0.1006 0.1047 0.1103 0.0858 ...
## $ Men.Single_35.39 : num 0.0847 0.0999 0.0769 0.0772 0.0951 ...
## $ Men.Single_40.44 : num 0.1308 0.0942 0.0686 0.0588 0.0693 ...
## $ Men.Single_45.49 : num 0.0944 0.102 0.0669 0.0625 0.0968 ...
## $ Men.Single_50.59 : num 0.13 0.162 0.134 0.191 0.142 ...
## $ Men.Single_60.65 : num 0.0472 0.0354 0.0435 0.0184 0.045 ...
## $ Women.Single_Base : int 418 803 43605 225 2313 93 351 765 12822 280 ...
## $ Women.Single_18.24 : num 0.263 0.208 0.287 0.324 0.326 ...
## $ Women.Single_25.29 : num 0.12 0.157 0.155 0.102 0.163 ...
## $ Women.Single_30.34 : num 0.0813 0.0809 0.1071 0.1333 0.0856 ...
## $ Women.Single_35.39 : num 0.055 0.0971 0.067 0.0667 0.1094 ...
## $ Women.Single_40.44 : num 0.0694 0.0772 0.0723 0.04 0.0441 ...
## $ Women.Single_45.49 : num 0.1196 0.0872 0.0811 0.0756 0.0605 ...
## $ Women.Single_50.59 : num 0.184 0.186 0.173 0.182 0.15 ...
## $ Women.Single_60.65 : num 0.1077 0.1071 0.0568 0.0756 0.061 ...
## $ POP_2000 : int 3141 5561 291826 1450 15563 997 1826 4847 97581 2508 ...
## $ Households_2000 : int 747 1929 113032 724 5195 969 1771 2427 41783 1631 ...
## $ POP_2015 : int 3304 5684 299107 1518 16258 970 2060 4979 99705 2560 ...
## $ Households_2015 : int 894 1944 114083 746 5191 917 1724 2422 41697 1590 ...
## $ Population_Change : num 0.0519 0.0221 0.0249 0.0469 0.0447 ...
## $ Population_Density : num 0.4732 1.2947 175.2542 0.0797 0.755 ...
As we see we have 19 variables with 7,440,252 observations. That’s a lot of rows!
Looking through the data it looks like our variables focus on state, gender ratios, age, ethnicity, and what we want to focus on, single ratios.
So, to start, let’s see the distribution of single people in the US:
## $title
## [1] "Ratio of Singles per Area"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
This is fascinating it looks like the data forms a normal distribution with a mean around 45% let’s calculate the summary statistics real quick:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3688 0.4404 0.4438 0.5161 1.0000
This is also fascinating the mean is 44.38% singles for any given area in the US with a lower quantile of 36.88% and an upper quantile of 51.61%.
Next, a variable of interest is “State”. Let’s create a simple histogram counting the observations from each state:
From our data it looks like the States that have the most rows are IL, MN, and PA.
This by itself isn’t particularly useful but it’s interesting to note which states have the most area observations.
As discussed earlier, there is a theory that gender ratios effect single percentage in an area. To explore this, we’ll want to explore the male_ratio and female_ratio variables. Let’s make some histogram plots of those two variables:
## $title
## [1] "Ratio of Men in Area"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
## $title
## [1] "Ratio of Female in Area"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
This is what we expected as we see a very thin normal distribution with the center being around 50% for both plots.
What’s interesting though is ratio of males is slightly below 50% and ratio of women is slightly above 50%.
Let’s run some summary statistics real quick:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4727 0.4947 0.4991 0.5215 1.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4785 0.5053 0.5009 0.5273 1.0000
As we can see the the Male ratio is 49.91% and the Female Ratio is 50.09% so there is slight imbalance of gender ratio in the US.
Next, let’s dive into single percentage by gender. To do this we’re going to need to create some new variables in our data, Ratio_Men_Single and Ratio_Women_Single:
As we did with our other variables, let’s plot a histogram of the data and see the distribution:
## $title
## [1] "Ratio of Single Men in Area"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
The distribution of Ratio_Men_Single looks like a normal distribution, but there appears to be a little bit of a right skew. Let’s run sum summary statistics real quick:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.09846 0.12752 0.13397 0.15900 1.00000
We have a mean of 13.40%, I’m a little bit surprised that number is so low!
Now let’s look at the ratio of single women:
## $title
## [1] "Ratio of Single Women in Area"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
Just like the Men Single Ratio graph it looks like the distribution is Normal but with a little bit of a right skew.
Let’s run some summary statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.08258 0.11355 0.11713 0.14659 1.00000
This is interesting, the mean is 11.71% which is less then the Male Single Ratio. Curious!
Let’s keep those statistics in the back of our mind and explore the next variable we are interested in, age.
A potential hypothesis in the dating community, is that in areas where the population is younger there would be more people single.
To explore this, let’s first create a new variable ‘ratio_under_30’:
Now let’s check the distribution of our new Ratio_Under_30 variable:
## $title
## [1] "Ratio Under 30"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
This looks like a normal distribution, with an average of 37%. Let’s run the summary statistics real quick:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3072 0.3624 0.3597 0.4153 1.0000
As we see the mean is 35.97%, stating that on average in the US 35.97% of the population is under 30.
For the rest of this exploration let’s look at the Single Ratio and try to understand if they’re are variables in our data set that correlate to a higher single percentage.
First let’s see the single ratio for each state in a box plot to see if they’re are any states that stand out:
From the chart, it looks like AK, MS, and NM have the highest ratios of single people.
Now let’s go one step further and see single ratio by gender by race in a scatter plot.
Interesting it looks like there’s some states that definitely have more single men then women. DC is leading the pack with Single Men, but that value might be skewed because DC is technically a District and not a state. As far as states AK is leading the pack.
Now let’s look at states with the most single women:
This plot is also interesting, it appears MS is leading thepack with highest ratio of single women. What’s also interesting is North Dakota has the lowest ratio of single women. I wonder what drives the differences in these two states…
Let’s circle back on our hypothesis that gender ratio effects single ratio of that gender.
To dig into this, we can use scatterplots to see if there is visual evidence of gender_ratio effecting the amount of singles for that gender.
First we’ll look at men, and we’re going to subset the data to remove extreme outliers which is Ratio_Men_Single = 0 as I’m not sure how that is possible:
Interesting! The smooth plot gives us a line that says as the ratio of males increase in an area the number of single men also increases. From visually looking at the data this seems most pronounced in the tails.
Now we’ll look at women, and we’re going to subset the data to remove extreme outliers which is Ratio_Women_Single = 0, (as again I’m not sure how that is possible).
This is interesting, as we saw in the single men graph, the smooth line shows that as the gender ratio increase there becomes more singles of that gender.
Also from an eyeball test this looks most pronounced at the tales of the graph.
There could be a story here: In areas where there are large gender imbalances there seems to be a higher ratio of singles for that gender. This of course makes sense since human coupling is 1 to 1. So if there are 30 women and 70 men, if all available women couple with 1 other man then there’s going to be 40 men without partners, hence a higher ratio male singles.
Let’s calculate correlation between gender ratio’s and gender single statistics real quick to see what that yields:
##
## Pearson's product-moment correlation
##
## data: Ratio_Female and Ratio_Women_Single
## t = 92.236, df = 44243, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3937479 0.4093784
## sample estimates:
## cor
## 0.4015924
##
## Pearson's product-moment correlation
##
## data: Ratio_Male and Ratio_Men_Single
## t = 113.83, df = 44754, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4666269 0.4809960
## sample estimates:
## cor
## 0.473843
Interesting the correlation between ratio of males and single males is 0.473843, that’s right in between 0 and 1 so it’s not super strong but it’s not minimial.
The correlation for ratio of females and single women is .4015924, which again is between 0 and 1 so it’s not a strong indicator but it’s not neglible either.
From looking at the graph, the trends seem to be most pronounced at the tails let’s try something real quick!
##
## Pearson's product-moment correlation
##
## data: Ratio_Female and Ratio_Women_Single
## t = 30.782, df = 1326, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6130669 0.6758904
## sample estimates:
## cor
## 0.6455695
##
## Pearson's product-moment correlation
##
## data: Ratio_Male and Ratio_Men_Single
## t = 37.865, df = 1442, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6792489 0.7310512
## sample estimates:
## cor
## 0.7060935
This is fascinating! In areas of extreme gender impalances the correlation is much higher. For example if we look at places that Ratio_Male < .375 and Ratio_Male > .625 we find that the correlation of Ratio_Male to Single Men is .7060935 which is much higher then the original correlation of 0.473843. Same holds true for the Female_Ratio compared to Single Women which is now 0.6455695 compared to .4015924
We have quite a few variables that we can use to understand their effect on single ratio. Let’s use a ggcor graph to understand the correlation between various variables.
First let’s subset our data to include only the variables that we want: “Ratio_Male”, “Ratio_Female”, “Ratio_Men_Single”, “Ratio_Women_Single”, “ratio_under_30” as the computation times go way up when we perform multi_variate analysis on the entire population data set:
## [1] "Ratio_Male" "Ratio_Female" "Ratio_Single"
## [4] "Ratio_Men_Single" "Ratio_Women_Single" "Ratio_Under_30"
Perfecto! We now have the variables that we care about.
Let’s now use ggcor to compare:
Interesting from the ggcors table, it looks like the ratio of age under 30 doesn’t effect single ratio very much. This is worth more exploring further.
As is obvious, single ratios of a particular gender correlate to single ratio, and as explored previously gender imbalances somewhat correlate to single ratios of that gender.
This is fun, next we’ll look at a bunch of variables and see how those effect single ratios.
So far we’ve explored comparing two variables against one another. Now it’s time to start doing multi-variate comparisons.
In the ggcor plot it appeared that ratio of people under 30 didn’t effect single rates but I find this odd.
Let’s play around with the data a little more and make a plot of 3 variables all at once. We’ll start with men, and look at Ratio_Male, Ratio_Men_Single, and Ratio_Under_30.
In order for us to better plot these we’re going to need to create a new variable called “Ratio_Under_30_Bucket”:
Great now that we have the Ratio_Under_30_Bucket let’s create a plot:
This is fascinating! From the graph it looks like in areas where the majority of the population is under 30 (80% +) there appears to be a higher ratio of single men. It also looks like that when that ratio is below 80% there doesn’t seem to be much difference.
Let’s see what the graph says for Women:
Interesting! In areas where a large majority of the population is under 30, (ie. 80%+) there appears to be a much higher ratio of singles.
We’re going to end this project here, but there are tons of ways to continue the analysis of this project. Let’s discuss in the next section.
We started this project with a population data set from Towncharts.com.
The variable we determined to be most interesting was single ratio, so we decided to analyze how the other variables correlated to this.
We started making plots comparing single ratio to other variables, the first of which was states:
We then hypothesized that single ratios will be influenced by the gender ratio of that area.
We plotted the data to visually to gage correlation between gender ratio and singles of that gender:
There was a clear correlation between gender ratio and singles of that gender.
We decided to do further analysis.
We decided to explore a 3 way plot showing the gender ratio, single percentage of that gender, and age buckets to understand how age correlates into all of this:
From the visuals, it looked as if areas that have a higher ratio of population under 30 also had a higher ratio of single percentage.
This left us with follow up questions.
We have quite a few more questions to answer!
Such as does Age affect the single %age up to a certain number and then looses it’s effect after that age?
More analysis can be done to see the correlation of age into single ratio. Also an analysis into why some states have greater single ratios then other states.
We haven’t touched on ethnicity, which could correlate to a higher or lower single percentage.
Further we haven’t looked at income/wealth of particular areas and how that can effect single percentage.
That’s the fun about Data Exploration the discovery never ends!